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Final  Executive  Summary 

The  goal  of  this  effort  has  been  to  develop  a  set  of  mathematical  and  computational  tools  for 
describing,  analyzing,  computing  and  exploiting  relationships  and  mappings  between  geometric 
data  sets,  both  pairwise  and  in  higher-order  combinations,  or  in  loose  networks  of  interrelated 
sets.  The  objective  is  to  analyze  geometric  data  sets  jointly,  organizing  data  collections  into 
(possibly  overlapping)  groups  of  related  sets  or  parts  thereof,  separating  what  is  common  from 
what  is  variable  within  each  group  and  across  groups,  and  understanding  the  main  axes  of 
variability. 

The  basic  thesis  of  the  work  has  been  that  geometric  data  sets  are  best  understood  not  in 
isolation  but  within  a  "social  network"  of  related  data  sets  and  their  associated  maps  and 
correspondences.  These  relational  networks  can  interconnect  data  sets  into  societies  where  the 
"wisdom  of  the  collection"  can  be  exploited  in  performing  operations  on  individual  data  sets 
better,  or  in  further  assessing  relationships  between  them.  By  creating  such  societies  of  data 
sets  and  their  associations  in  a  globally  consistent  way,  we  enable  a  certain  joint  understanding 
that  provides  the  powers  of  abstraction,  analogy,  compression,  error  correction,  and 
summarization  over  the  data. 

For  example,  given  a  collection  of  images  with  shared  content  across  a  number  of  object 
categories  (e.g.,  airplanes),  our  network  analysis  techniques  are  able  to  learn  the  shared 
categories  and  discover  the  object(s)  in  each  category  contained  in  each  image.  Furthermore, 
this  is  accomplished  in  a  fully  unsupervised  manner  and  the  results  surpass  some  state  of  the 
art  methods  that  use  supervision.  Of  course  supervision  can  be  added  to  further  improve  the 
results. 

Of  particular  interest  this  past  year  has  been  algorithms  for  relating  and  interconnecting  diverse 
modalities  that  provide  information  about  objects  in  the  world,  including  images,  sketches,  3D 
scans,  3D  models,  and  language.  Different  modalities  often  capture  distinct  types  of 
information  about  the  nature  and  state  of  objects  in  the  world,  so  that  the  information  fusion 
made  possible  by  this  integration  creates  new  integrated  knowledge  not  available  separately 
from  any  of  the  modalities. 

Unlike  traditional  data  fusion,  in  our  setting  fusion  is  possible  not  only  at  the  level  of  object 
instances  but  also  across  object  categories,  through  the  abstraction  mechanisms  we  have 
developed  in  our  networks.  In  particular,  we  have  aimed  to  provide  additional  information  or 
knowledge  about  captured  signals  (e.g.,  images),  in  a  real  time  setting  -  information  that  is 
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NOT  present  in  the  raw  signal  but  is  inferred  from  contextual  knowledge  present  in  the 
network.  For  example,  when  we  see  a  chair  partially  occluded  by  a  table,  we  can  usually  make  a 
pretty  good  guess  about  what  the  occluded  portions  looks  like,  because  we  may  see  other 
identical  chairs  in  the  same  environment,  or  because  we  have  memory  of  having  seen  other 
similar  chairs  in  analogous  settings  in  the  past.  We  have  aimed  to  endow  computers  with 
exactly  this  ability  to  "imagine  the  unseen"  and  have  made  substantial  progress  on  this  front. 

Specific  Year  Accomplishments/Findings 

I.  Common  Embedding  Spaces  for  Multimodal  Data,  Combining  Visual  Representations  and 
Language 

We  have  developed  a  new  method  for  structuring  multi-modal  representations  of  shapes 
according  to  semantic  relations.  We  learn  a  metric  that  links  semantically  similar  objects 
represented  in  different  modalities.  First,  3D-shapes  are  associated  with  textual  labels  by 
learning  how  textual  attributes  are  related  to  the  observed  geometry.  Correlations  between 
similar  labels  are  captured  by  simultaneously  embedding  labels  and  shape  descriptors  into  a 
common  latent  space  in  which  an  inner  product  corresponds  to  similarity.  The  mapping  is 
learned  robustly,  by  optimizing  a  rank-based  loss  function  under  a  sparseness  prior  for  the 
spectrum  of  the  matrix  of  all  classifiers.  Second,  we  extend  this  framework  to  relating  multi¬ 
modal  representations  of  the  geometric  objects. 

The  key  idea  is  that  weak  cues  from  shared  human  labels  are  sufficient  to  obtain  a  consistent 
embedding  of  related  objects  even  though  their  representations  are  not  directly  comparable. 
Technically,  we  accomplish  the  assignment  of  labels  to  3D  geometry  by  learning  a  low-rank 
classifier  matrix  that  recognizes  similarities  of  labels  through  correlations  in  shape.  This  permits 
information  sharing  across  geometrically  similar  objects  as  well  as  semantically  related  labels. 

In  experiments,  we  can  clearly  see  an  advantage  in  performance  over  baseline  methods  that 
ignore  this  side  information.  Moreover,  we  have  generalized  the  idea  of  multi-label 
classification  through  a  low-dimensional  latent  space  to  obtain  a  novel  cross-modal  embedding 
of  objects.  This  can  be  used  for  object  retrieval  across  different  modalities  and  for  interactive 
explorations  of  complex  data  spaces. 

We  have  evaluated  our  method  against  common  base-line  approaches,  investigated  the 
influence  of  different  geometric  descriptors,  and  demonstrated  a  prototypical  multi-modal 
browser  that  relates  3D-objects  with  text,  photographs,  and  2D  line  sketches. 

This  work  has  appeared  at  the  2015  Eurographics  Symposium  on  Geometry  Processing. 

II.  The  ShapeNet  Repository 

In  order  to  develop  and  test  our  shape  mapping  algorithms  at  scale,  we  have  initiated  an  effort 
to  collect  and  annotate  a  large  corpus  of  3D  CAD  models  that  we  call  ShapeNet 
(http://shapenet.cs.stanford.edu).  This  is  a  joint  effort  with  Professors  Pat  Hanrahan  and  Silvio 
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Savarese  at  Stanford,  and  Tom  Funkhouser  and  Jianxiong  Xiao  at  Princeton.  The  repository 
contains  3D  models  from  a  multitude  of  semantic  categories  and  organizes  them  under  the 
WordNet  taxonomy.  In  addition  to  categories,  ShapeNet  currently  provides  consistent  rigid 
alignments  and  bilateral  symmetry  planes  for  each  3D  model.  These  annotations,  as  well  as 
other  planned  semantic  annotations,  are  made  available  through  a  public  web-based  interface 
to  enable  data  visualization  of  object  attributes,  promote  data-driven  geometric  analysis,  and 
provide  a  large-scale  quantitative  benchmark  for  research  in  computer  graphics  and  vision. 
Planned  annotations  include  object  parts  and  part  names,  local  as  well  as  global  symmetries, 
physical  properties  such  as  size  and  weight,  and  affordances  /  functionality  (how  the  shape  is 
used).  Maps  and  correspondences  between  shapes  will  be  included,  as  well  as  between  shapes 
and  images. 

ShapeNet  aims  to  fill  a  large  gap  in  the  3D  repositories  that  currently  exist,  which  are  either 
large  (e.g.,  the  Trimble  3D  Warehouse,  2.5M  shapes)  but  poorly  annotated,  or  annotated  but 
small  (e.g,  the  Princeton  Shape  Benchmark,  1.8K  shapes).  Recently  large  data  sets  of  images, 
such  as  Imagenet  (Deng  et  al.  2009, 14M  images  organized  into  20K  categories  associated  with 
"synsets"  of  WordNet)  have  played  a  major  role  in  key  data-driven  advances  in  computer 
vision,  in  part  by  providing  rich  training  data  for  machine  learning  algorithms.  The  same  has 
occurred  with  natural  language  processing  (NLP  -  e.g.,  in  machine  translation)  and  our  belief  is 
that  similar  breakthroughs  can  happen  with  3D  data. 

This  is  a  seed  effort  intend  to  lead  to  a  separately  supported  project. 

III.  Shape  Completion  for  Incomplete  3D  Scans 

Acquiring  3D  geometry  of  an  object  is  a  tedious  and  time-consuming  task,  typically  requiring 
scanning  the  surface  from  multiple  viewpoints.  In  work  this  past  year  we  focused  on 
reconstructing  complete  geometry  from  a  single  scan  acquired  with  a  low-quality  consumer- 
level  scanning  device,  even  in  the  presence  of  significant  occlusions  (and  of  course  self¬ 
occlusions).  Our  method  is  class-based  and  uses  a  network  of  example  3D  shapes  to  build 
structural  part-based  priors  that  are  necessary  to  complete  shapes  in  that  class.  In  our 
representation,  we  associate  a  local  coordinate  system  to  each  part  and  then  learn  the 
distribution  of  positions  and  orientations  of  all  the  other  parts  from  the  network,  which 
implicitly  also  defines  the  positions  of  symmetry  planes  and  symmetry  axes.  At  the  inference 
stage,  this  knowledge  can  be  transported  to  the  new  scan  and  used  to  analyze  incomplete  point 
clouds  with  substantial  occlusions,  because  observing  only  a  few  regions  is  still  sufficient  to 
infer  the  global  structure.  Once  the  parts  and  the  symmetries  are  estimated,  both  data  sources, 
symmetry  and  database,  are  fused  to  complete  the  point  cloud,  providing  much  better  results 
than  either  of  them  alone  could. 

Our  main  technical  contribution  is  a  data-driven  technique  for  estimating  shape  structure  from 
incomplete  point  clouds.  The  key  difference  from  previous  approaches  is  that  our  method  does 
not  rely  on  a  global  coordinate  system  —  instead  every  part  defines  local  coordinates,  and  then 
all  parts  are  jointly  optimized  to  find  their  most  plausible  arrangement.  This  enables  the 
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prediction  of  parts  in  occluded  regions,  and  the  estimation  of  symmetries  even  if  the  input 
partial  scan  is  asymmetric  due  to  occlusions. 

We  have  evaluated  our  technique  on  a  synthetic  dataset  containing  481  shapes,  and  on  real 
scans  acquired  with  a  Kinect  scanner.  Our  method  demonstrates  high  accuracy  for  the 
estimated  part  structure  and  detected  symmetries,  enabling  higher  quality  shape  completions 
in  comparison  to  all  alternative  techniques.  Furthermore,  we  have  publically  released  our 
benchmark  data  set  so  that  others  working  in  this  area  can  evaluate  their  methods  and 
compare  them  to  ours. 

This  work,  an  instance  of  "imagining  the  unseen,"  will  appear  at  Siggraph  Asia  2015. 

IV.  3D-Assisted  Feature  Synthesis  for  Novel  Views 

We  are  especially  interested  in  being  able  to  link  images  that  represent  views  of  the  same 
object  from  very  different  viewpoints.  Comparing  different  views  has  been  a  long-standing 
challenging  problem  in  computer  vision,  as  visual  features  are  not  stable  under  large  view  point 
changes.  In  work  this  past  year,  given  a  single  input  image  of  an  object,  we  have  developed 
tools  able  to  synthesize  its  features  for  other  views,  leveraging  an  existing,  modestly-sized  3D 
model  collection  of  related  but  not  identical  objects.  3D  models  can  provide  strong  prior 
information  to  help  an  algorithm  "imagine"  what  the  underlying  3D  object  should  look  like  from 
novel  views.  To  accomplish  this  feature  transport  to  new  views,  we  study  the  relationship  of 
image  patches  between  different  views  of  the  same  3D  object,  seeking  what  we  call  "surrogate 
patches"  —  patches  in  one  view  whose  feature  content  predicts  well  the  features  of  a  patch  in 
another  view.  These  surrogate  relationships  are  learned  from  the  analysis  of  a  co-aligned  set  of 
3D  models  in  a  given  class.  Note  that,  indirectly,  these  surrogate  patch  relationships  encode 
geometric  global  or  local  symmetries  of  the  underlying  3D  models,  without  having  to  first 
estimate  the  latter. 

When  an  image  of  an  object  is  provide,  we  first  estimate  its  pose  and  then  develop  local  linear 
models  for  predicting  its  features  using  features  from  the  same  views  of  our  3D  models.  We 
finally  use  our  surrogate  relationships  for  transferring  the  same  linear  combination  to  estimate 
features  for  the  new  view.  Based  upon  these  surrogate  relationships,  we  can  in  fact  create 
feature  sets  for  all  views  of  the  latent  object  on  a  per  patch  basis,  providing  us  an  augmented 
view-independent  representation  of  the  object.  We  note  that  our  method  can  work  with  many 
common  image  feature  sets,  including  HoG,  CNN,  etc. 

The  method  allows  us  to  compare  images  of  objects  from  very  different  views.  In  addition  to 
demonstrating  that  we  do  much  better  in  such  cross  view  comparisons  than  traditional  image- 
based  methods,  we  have  explored  a  number  of  other  applications  of  our  techniques.  These 
include  (a)  part-based  image  retrieval,  where  we  query  for  similar  images  within  a  specified 
region,  aiming  again  at  view  independent  results,  (b)  fine-grained  image  retrieval  and  object 
categorization,  and  (c)  instance  retrieval  (looking  for  exactly  the  same  object  in  other  images). 
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This  work,  a  second  instance  of  "imaging  the  unseen,"  will  appear  at  ICCV  2015  (oral). 


V.  Analyzing  and  Using  the  Shape  of  White  Matter  Brain  Structures 

In  recent  years,  a  focus  of  the  neuroscience  community  has  been  to  understand  the  role  of 
white  matter  in  human  cognition  and  function.  We  have  developed  a  set  of  tools  to  study  the 
3D  shape  and  shape-variability  in  major  human  white  matter  tracts.  First,  a  mapping  tool  that 
extracts  the  skeleton  of  a  tract  and  performs  fine  grained  spatial  clustering  to  identify 
correspondence  between  areas  of  tracts  from  different  brains.  This  tool  allows  comparison  of 
tissue  properties  on  a  fine-scale  between  individuals,  mapping  the  3D  morphology  of  these 
structures  across  large  human  populations,  thus  exploring  the  available  neuroimaging  datasets 
in  a  principled  manner.  Second,  a  synthesis  and  simulation  tool  that  applies  a  set  of  simple 
shape-deformations,  such  as  bending  and  shearing,  on  a  given  tract.  This  enables 
parametrization  of  the  underlying  shape  space  and  analysis  of  inter-individual  normal  and 
pathological  shape  variability.  This  work  is  currently  ongoing.  We  have  tested  the  fine-mapping 
tool  on  a  cohort  of  subjects  from  the  ADNI  dataset  and  showed  that  we're  able  to  accurately 
map  corresponding  areas  on  a  variety  of  tracts  across  individuals. 

Refereed  Publications  (past  year  only) 

Robert  Herzog,  Daniel  Mewes,  Michael  Wand,  Leonidas  Guibas,  Hans-Peter  Seidel.  LeSSS: 
Learned  Shared  Semantic  Spaces  for  Relating  Multi-Modal  Representations  of  3D  Shapes.  In: 
Computer  Graphics  Forum  (Proc.  SGP  2015),  2015. 

Minhyuk  Sung,  Vladimir  Kim,  Roland  Angst,  Leonidas  Guibas.  Data-Driven  Structural  Priors  for 
Shape  Completion.  ACM  Transaction  on  Graphics  (Proc.  Siggraph  Asia),  vol  34,  6,  2015  (to 
appear). 

Hao  Su,  Fan  Wang,  Eric  Yi,  and  Leonidas  Guibas.  3D-Assisted  Feature  Synthesis  for  Novel  Views 
of  an  Object.  International  Conference  on  Computer  Vision  (ICCV),  Santiago,  Chile,  2015  (to 
appear). 

Personnel  Supported  (past  year  only) 

1.  Leonidas  Guibas,  Faculty  PI,  1.00  month 

2.  Peter  Huang,  Postdoctoral  Fellow,  0.30  month  [Shapenet  repository] 

3.  Vladimir  Kim,  Postdoctoral  Fellow,  7.25  months  [shape  completion] 

4.  Tany  Glozman,  Graduate  Student,  1.50  months  [white  matter  brain  structures] 

A  Stanford  undergraduate  student  (Ivan  Robles)  also  worked  with  us  his  summer  on  topics 
related  to  this  grant. 


DISTRIBUTION  A:  Distribution  approved  for  public  release. 


Education 


We  are  planning  a  new  course,  CS233:  The  Shape  of  Data  -  Geometric  and  Topological  Data 
Analysis,  based  to  a  significant  part  on  material  developed  under  this  grant.  This  course  will  be 
taught  at  Stanford  in  the  spring  of  2016. 

Interactions/Transitions 

The  work  described  this  year's  and  in  previous  reports  has  been  or  will  be  presented  in  the  very 
top  venues  in  computer  vision,  computer  graphics  and  machine  learning.  Earlier  work  was  also 
presented  at  the  AFOSR  Computational  Cognition  annual  meetings. 

Several  companies  have  expressed  strong  interest  in  our  work.  We  are  currently  collaborating 
on  topics  related  to  this  grant  with  Adobe,  Apple,  Autodesk,  and  Google. 

PI  Honors/ Awards 

•  2013  Eurographics  Symposium  on  Geometry  Processing,  Best  Paper  Award 

•  2013  International  Conference  on  Computer  Vision  Helmholtz  Award  (recognizes  ICCV 
papers  from  ten  years  ago  with  significant  impact  on  computer  vision  research) 
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